Skip to content

Support for Chinese, Japanese, Korean#13

Open
patrick-wilken wants to merge 9 commits intomainfrom
feature/cjk_support
Open

Support for Chinese, Japanese, Korean#13
patrick-wilken wants to merge 9 commits intomainfrom
feature/cjk_support

Conversation

@patrick-wilken
Copy link
Collaborator

@patrick-wilken patrick-wilken commented Sep 3, 2025

Because of missing word tokenization, SubER so far does not give meaningful scores when calculated directly on "scriptio continua" languages like Chinese and Japanese. The same is true for many of the other available metrics.

Zhou and Yoshinaga noticed and rightfully criticized this in a recent paper

It's already possible to apply tokenization to the subtitle files as a preprocessing step and then call SubER. This is apparently also what was done in the paper, using "SacreBLEU's TER tokenizer" (TercomTokenizer with "asian_support"?). But this is inconvenient, and - more importantly - tokenization should be standardized by making it a part of the evaluation tool. In particular, because different metrics require different tokenization (e.g. BLEU vs TER), including no tokenization (chrF and CER).

I have implemented word tokenization here for all the metrics where necessary. But the choices I made are still to be checked. For the sacrebleu metrics (BLEU, TER, chrF) the situation is pretty clear: we want to pass the correct language information to sacrebleu, but otherwise use its implementation without changes. Even if potentially something is not handled correctly.
For WER, jiwer does not provide built-in tokenization, so we have a free choice. For now, I'm using the TercomTokenizer with "asian_support", because SubER is based on the TER implementation, and WER and TER are very related, so it's the most consistent thing to do. However, I have my doubts that for Japanese the TercomTokenizer does a good job. MeCab, as used for BLEU, seems to be a much more common and much better choice? I will need to talk to native speakers and also colleagues experienced in evaluation of Japanese ASR to confirm.
The same applies to SubER itself. Most consistent would be to use TercomTokenizer, then TER and SubER scores would be closely related also for these languages. But MeCab might be better for Japanese. For Chinese I'm already quite confident it's working well after these changes, I'm getting very similar scores to applying character splitting to the input subtitle files.

For Korean I'm generally not sure "how much" a tokenizer is needed. All subtitle files I have seen use spaces. And the "asian_support" of TercomTokenizer does nothing to Korean text, as far as I can see. But we should definitely pass "ko" as language code to sacrebleu so that BLEU makes use of the MeCab-ko tokenizer. I'm assuming it is supposed to improve handling the rich morphology? Then it might make sense to use for SubER as well. Will need to talk to people with language expertise here as well.
(I might already be embarrassing myself with the Chinese/Japanese unit tests I wrote. 😄)

Hypothesis to reference alignment for AS- and t- metrics is still missing. Should be done using the same tokenizer as for SubER, but needs to be reversed before computing the metrics (which may or may not tokenize themselves again).

This is a breaking change, but should be considered a bug fix.
Still to be checked. Maybe better tokenization than TercomTokenizer is
needed for SubER, especially for Japanese?
Hypothesis to reference alignment for AS- and t- metrics still missing.
@patrick-wilken
Copy link
Collaborator Author

Hypothesis to reference alignment is now implemented, so AS-BLEU, t-BLEU etc. can be correctly calculated for Japanese, Chinese and Korean.

Also, for SubER and WER I switched to the tokenizers used by SacreBleu for these languages, namely MeCab and "zh", in favor of TercomTokenizer with "asian_support". In particular due to the fact that TercomTokenizer does not split sequences of Hiragana and Katakana characters at all.

I manually tested this for multiple Japanes and Chinese files and I get the same SubER scores as when applying those tokenizers to hypothesis and reference before calling the SubER tool*. There are slight tokenization differences when applying MeCab to space-separated words/parts, like the SubER code does it, as opposed to applying it to full subtitle lines. But the score difference is <0.2 SubER points, and neither method is more correct, I guess.

*(Note, that to reproduce exact SubER scores - as opposed to SubER-cased - one also has to remove punctuation from the tokenized input files, otherwise those pure punctuation tokens will not be normalized away during SubER calculation.)

I also tested that running the align_hyp_to_ref tool and then running sacrebleu on the aligned hypothesis gives the same scores as when calling the SubER tool to compute AS-BLEU (python3 -m suber -l {ja,zh} -m AS-BLEU [...]).

So, quite confident that the implementation is correct, despite being someone complicated. I guess I'll run the same manual tests also for Korean, and also add more unit tests for these languages. But should be good to use already.

@patrick-wilken patrick-wilken marked this pull request as ready for review February 12, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant